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Abstract 

In this work we show that iterative threshoiding and K means (ITKM) algorithms can recover a generating dictionary 
with K atoms from noisy S sparse signals up to an error e as long as the initialisation is within a convergence radius, that 
is up to a log K factor inversely proportional to the dynamic range of the signals, and the sample size is proportional to 
K\ogKe~^. The results are valid for arbitrary target errors if the sparsity level is of the order of the square root of the 
signal dimension d and for target errors down to K~‘ if S scales as S < d/{eiogK). 

Index Terms 

dictionary learning, sparse coding, sparse component analysis, sample complexity, convergence radius, alternating 
optimisation, thresholding, K-means 

1 Introduction 

The goal of dictionary learning is to find a dictionary that will sparsely represent a class of signals. That 
is given a set of N training signals G which are stored as columns in a matrix Y = (yi,..., y]\f), one 
wants to find a collection of K normalised vectors (l)k G called atoms, which are stored as columns 
in the dictionary matrix $ = ..., G and coefficients Xn, which are stored as columns in 

the coefficient matrix X = {xi,... ,xn) such that 

Y = and X sparse. (1) 

Research into dictionary learning comes in two flavours corresponding to the two origins of the problem, 
the slightly older one in the independent component analysis (ICA) and blind source separation (BSS) 
community, where dictionary learning is also known as sparse component analysis, and the slightly 
younger one in the signal processing community, where it is also known as sparse coding. The main 
motivation for dictionary learning in the ICA/BSS community comes from the assumption that the 
signals of interest are generated as sparse mixtures - random sparse mixing coefficients Xq - of several 
sources or independent components - the dictionary ^>o - which can be used to describe or explain a 
(natural) phenomenon, HTSl . Il30l . ETi . For instance in the 1996 paper by Olshausen and Field, ESI , 
which is widely regarded as the mother contribution to dictionary learning, the dictionary is learned 
on patches of natural images, and the resulting atoms bear a striking similarity to simple cell receptive 
fields in the visual cortex. A natural question in this context is, when the generating dictionary $0 can 
be identified from Y, that is, the sources from the mixtures. Therefore the first theoretical insights into 
dictionary learning came from this community, IITSl . Also the first dictionary recovery algorithms with 
global success guarantees, which are based on finding overlapping clusters in a graph derived from the 
signal correlation matrix Y*Y, take the ICA/BSS point of view, |3, O. 

The main motivation for dictionary learning in the signal processing community is that sparse signals are 
immensely practical, as they can be easily stored, denoised, or reconstructed from incomplete information, 
lUS/ [ 33 J, II 3 TI . Thus the interest is less in the dictionary itself but in the fact that it will provide sparse 
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representations X. Following the rule 'the sparser - the better' the obvious next step is to look for the 
dictionary that provides the sparsest representations. So given a budget of K atoms and S non-zero 
coefficients per signal, one way to concretise the abstract formulation of the dictionary learning problem 
in Q is to formulate it as optimisation problem, such as 

{P 2 ,s) min ||y — <l>X||i7’ s.t. ||xn||o < S' and G P, (2) 

where || • ||o counts the nonzero elements of a vector or matrix and V is defined as 2? = {$ = ..., : 

\\(l>k \\2 = 1}. While {P 2 ,s) is for instance the starting point for the MOD or K-SVD algorithms, IT^ . 1^ , other 
definitions of optimally sparse lead to other optimisation problems and algorithms, IHl, ||3Zl, BHl, Il32l . 
Il43l . Il38l . The main challenge of optimisation programmes for dictionary learning is finding the global 
optimum, which is hard because the constraint manifold V is not convex and the objective function is 
invariant under sign changes and permutations of the dictionary atoms with corresponding sign changes 
and permutations of the coefficient rows. In other words for every local optimum there are 2^K\ — 1 
equivalent local optima. 

So while in the signal processing setting there is a priori no concept of a generating dictionary, it is often 
used as auxiliary assumption to get theoretical insights into the optimisation problem. Indeed without 
the assumption that the signals are sparse in some dictionary the optimisation formulation makes little 
or no sense. For instance if the signals are uniformly distributed on the sphere in in asymptotics 
{P 2 ,s) becomes a covering problem and the set of optima is invariant under orthonormal transforms. 
Based on a generating model on the other hand it is possible to gain several theoretical insights. For 
instance, how many training signals are necessary such that the sparse representation properties of a 
dictionary on the training samples (e.g. the optimiser) will extrapolate to the whole class, Il34l , BTl . 
H35l . 1120] . What are the properties of a generating dictionary and the maximum sparsity level of the 
coefficients and signal noise such that this dictionary is a local optimiser or near a local optimiser given 
enough training signals, l2T|, (TZl, lEH, |j40l, 119] . 

An open problem for overcomplete dictionaries with some first results for bases, UMI . H45l , is whether 
there are any spurious optimisers which are not equivalent to the generating dictionary, or if any starting 
point of a descent algorithm will lead to a global optimum? A related question (in case there are spurious 
optima) is, if the generating dictionary is the global optimiser? If yes, it would justify using one of the 
graph clustering algorithms for recovering the optimum, |j6|, (21, 01, 0. This is important since all 
dictionary learning algorithms with global success guarantees are computationally very costly, while 
optimisation approaches are locally very efficient and robust to noise. Knowledge of the convergence 
properties of a descent algorithm, such as convergence radius (basin of attraction), rate or limiting 
precision based on the number of training signals, therefore helps to decide when it should take over 
from a global algorithm for fast local refinement, (H. 

In this paper we will investigate the convergence properties of two iterative thresholding and K-means 
algorithms. The first algorithm ITKsM, which uses signed signal means, originates from the response 
maximisation principle introduced in |40|. There it is shown that a generating p-coherent dictionary 
constitutes a local maximum of the response principle as long as the sparsity level of the signals scales as 
S = 0{p~^). It further contains the first results showing that the maximiser remains close to the generator 
for sparsity levels up to 5 = 0(/x“^/logiT). For a target recovery error e the sample complexity N is 
shown to scale as N = 0{SK^e~‘^) and the basin of attraction is conjectured to be of size 0(1/\^S). 
Here we will not only improve on the conjecture by showing that in its online version the algorithm has 
a convergence radius of size 0(l/\/logiT) but also show that for the algorithm rather than the principle 
the sample complexity reduces to A = 0{K log Ke~‘^\og{£~^)) (omitting log log factors). Again recovery 
to arbitrary precision holds for sparsity levels S = 0{p~^) and stable recovery up to an error K~^ for 
sparsity levels S = 0(p“^/(21ogAT)). We also show that the computational complexity assuming an 
initialisation within the convergence radius scales as 0{log{e~^)dKN) or omitting log factors 0{dK^e~^). 
Motivated by the desire to reduce the sample complexity for the case of exactly sparse, noiseless signals, 
we then introduce a second iterative thresholding and K-means algorithms ITKrM, which uses residual 
instead of signal means. It has roughly the same properties as ITKsM apart from the convergence radius 
which reduces to 0(1/^/S) and the computational complexity, which scales as 0{dN{K + 5^)) and thus 
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can go up to 0{d?NK) for S = 0{d). However, if S' = and the signals follow an exactly sparse, 

noiseless model, we can show that the sample complexity reduces to log(e“^)) (omitting 

log log factors). Our results are in the same spirit as the results for the alternating minimisation algorithm 
in m but have the advantage that they are valid for more general coefficient distributions and a lower 
level of sparsity (S larger) resp. higher level of coherence, that the convergence radius is larger and that 
the algorithms exhibit a lower computational complexity They are also close to some very recent results 
about several alternating minimisation algorithms, which are like the ITKMs based on thresholding, 
Il5| . Compared to our results they are essentially the same in terms of convergence radius and sample 
complexity but are only valid for sparsity levels S = 0(p“^) and up to a limiting precision (even in the 
exact sparse noiseless case). More interestingly ||5l contains a strategy for finding initialisations within a 
radius 0(1/ logit') to the generating dictionary, which is proven to succeed for sparsity levels S = 0(/i“^). 
With slight modifications and using Tropp's results on average isometry constants, Il4^ . this initialisation 
strategy could probably be proven to work also for sparsity levels up to 5 = 0{/j~‘^/{£logK)). However, 
its computational complexity seems to explode as S grows. 

The rest of the paper is organised as follows. After summarising notation and conventions in the following 
section, in Section we re-introduce the ITKsM algorithm, discuss our sparse signal model and analyse 
the convergence properties of ITKsM. Based on the shortcomings of ITKsM we motivate the ITKrM 
algorithm in Section]^ and again analyse its convergence properties. In Section]^ we provide numerical 
simulations indicating that the convergence radius of both ITKM algorithms is generically much larger 
and that sometimes ITKrM even converges globally from random initialisations. Finally in Section]^ we 
compare our results to existing work and point out future directions of research. 

2 Notations and Conventions 

Before we join the melee, we collect some definitions and lose a few words on notations; usually 
subscripted letters will denote vectors with the exception of e, a, ui, where they are numbers, eg. Xn G 
vs. Ek G M, however, it should always be clear from the context what we are dealing with. 

For a matrix M, we denote its (conjugate) transpose by M* and its Moore-Penrose pseudo inverse by 
Ml. We denote its operator norm by ||M||2,2 = niax|| 3 ,|| 2 =i ||Mx||2 and its Frobenius norm by ||M||i? = 
tr(M*M)^/^, remember that we have ||M|| 2,2 < 

We consider a dictionary <I) a collection of K unit norm vectors (pk £ II'/’fell 2 = 1- By abuse of 
notation we will also refer to the d x K matrix collecting the atoms as its columns as the dictionary, i.e. 

= ((/>i,... (Pk)- The maximal absolute inner product between two different atoms is called the coherence 
/i of a dictionary, p = Kcpk, </>j)|- 

By <F/ we denote the restriction of the dictionary to the atoms indexed by I, i.e. <F/ = {(pi^,... ,cpi^), ij G I, 
and by P{^i) the orthogonal projection onto the span of the atoms indexed by I, i.e. P{^i) = 

Note that in case the atoms indexed by I are linearly independent we have <1>| = (<Fj<l> 7 )“^<l>j. We 
also define Q{^i) the orthogonal projection onto the orthogonal complement of the span on that is 
Qi^i) = Id — Pi^i), where Id is the identity operator (matrix) in 

(Ab)using the language of compressed sensing we define (5/(<F) as the smallest number such that all 
eigenvalues of are included in [1 — (5/(<F),l -|- (i/(<F)] and the isometry constant 5s(‘F) of the 

dictionary as ( 55 (<F) := max| 7|<5 When clear from the context we will usually omit the reference to 

the dictionary. For more details on isometry constants, see for instance lUTl . 

To keep the sub(sub)scripts under control we denote the indicator function of a set V by x(V, •), that is 
x(V, v) is one if v G V and zero else. The set of the first S integers we abbreviate by S = {!,...,5}. 

We define the distance of a dictionary 'F to a dictionary as 

(i(<F, T') := maxmin \\(pk ± V’flb = maxmin \/2 — 2\{(pi~,'ip£)\. (3) 

k i k £ 

Note that this distance is not a metric, since it is not symmetric. For example if is the canonical basis 
and 'F is defined by tpi = cpi for f > 3, V'l = (ei- 1 - 62 ) /a/2, and 'i /2 = Zli (pi/'/d then we have d{^, 'F) = l/\/2 

while (i('F,<F) = J2 — 2l\fd. A symmetric distance between two dictionaries <F,'F could be defined as 




the maximal distance between two corresponding atoms, i.e. 

:= minmax||(/>fc±V’p(fc)|| 2 , (4) 

pGV k 

where V is the set of permutations of {1,..., 5}. Since locally the distances are equivalent we will state 
our results in terms of the easier to calculate asymmetric distance and assume that 'I' is already signed 
and rearranged in a way that '&) = max^ — V’fcib- 

We will make heavy use of the following decomposition of a dictionary 'k into a given dictionary and 
a perturbation dictionary Z. If d{^, <1>) = e we set ||V’A: — (pk \\2 = ^k, where by definition max?; eu = e. We 
can then find unit vectors Zk with {4>k,Zk) = 0 such that 

tjjk = oik(l)k + u}kZk, for, ak:=l-ell2 and Wfc := (e| - 4/4)^ • (5) 

The dictionary Z collects the perturbation vectors on its columns, that is Z = {zi,.. .zk) and we define 
the diagonal matrices Aj, Wj implicitly via 

= ^iAi + ZiWi, ( 6 ) 

or in MATLAB notation Aj = diag(a 7 ) with aj = {ak)k&i and analogue for Wj. Based on this decompo¬ 
sition we further introduce the short hand bk = '^Zk and Bi = ZjWjA'J^. 

We consider a frame F a collection oi K > d vectors fk G for which there exist two positive constants 
A, B such that for all v we have 

K 

A\\v\\l<Y,\{fk,v)? <B\\v\\l (7) 

k=l 

If B can be chosen equal to A, i.e. B = A, the frame is called tight and if all elements of a tight frame have 
unit norm we have B = A = K/d. The operator FF* is called frame operator and by Q its spectrum is 
bounded by A, B. For more details on frames, see e.g. ||12|. 

Finally we introduce the Landau symbols O, o to characterise the growth of a function. We write 

fit) = 0{g{t)) if lim fit)/g{t) = C < oo 

t^O/oo 

and fit) = oigit)) if lim f{s)/gie) = 0. 

t-s-O/oo 

3 Dictionary Learning via ITKsM 

Iterative thresholding and K signal means (ITKsM) for dictionary learning was introduced as algorithm 
to maximise the S'-response criterion 

iPm) max J]]max||T'^y„||i, (8) 

n ' ' 

which for S' = 1 reduces to the K-means criterion, Il40l . It belongs to the class of alternating optimisation 
algorithms for dictionary learning, which alternate between updating the sparse coefficients based on 
the current version of the dictionary and updating the dictionary based on the current version of 
the coefficients, m, a m- As its name suggests, the update of the sparse coefficients is based on 
thresholding while the update of the dictionary is based on K signal means. 

Algorithm 3.1 (ITKsM one iteration). Given an input dictionary and N training signals yn do: 

• For all n find = argmax/.|j |=5 ||T'^y„||i. 

• For all k calculate 

^ ■ S^mi{i^k,yn)) ■xili,n,k)- 

n 

• Output ^ = {'tpi/\\lj;i\\ 2 , ... ,lfK/\\'ipKh)- 


( 9 ) 
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The algorithm can be stopped after a fixed number of iterations or once a stopping criterion, such as 
improvement T') < 0 for some threshold Q, is reached. Its advantages over most other dictionary 
learning algorithms are threefold. First it has very low computational complexity. In each step the most 
costly operation is the calculation of the N matrix vector products that is the matrix product 

of order 0{dKN). In comparison the globally successful graph clustering algorithms need to calculate 
the signal correlation matrix Y*Y, cost 0{dN‘^). 

Second due to its structure only one signal has to be processed at a time. Instead of calculating for 
all n and calculating the sum, one simply calculates l\, ^ for the signal at hand, updates all atoms 'ipk for 
which k £ l\, ^ as Tpk ^ i^k + Vn ■ sign((?/;fc, y„)) and turns to the next signal. Once N signals have been 
processed one does the normalisation step and outputs Further in this online version only {2K + l)d 
values corresponding to the input dictionary, the current version of the updated dictionary and the signal 
at hand, need to be stored rather than the N xd signal matrix. Parallelisation can be achieved in a similar 
way. Again for comparison, the graph clustering algorithms, K-SVD, O, and the alternating minimisation 
algorithm in [ll need to store the whole signal resp. residual matrix as well as the dictionary. 

The third advantage is that with high probability the algorithm converges locally to a generating dictio¬ 
nary <I> assuming that we have enough training signals and that these follow a sparse random model in 
$. In order to prove the corresponding result we next introduce our sparse signal model. 


3.1 Signal Model 

We employ the same signal model, which has already been used for the analyses of the S-response and 
K-SVD principles, (391, llQl- Given a d x K dictionary <1>, we assume that the signals are generated as. 


y = 


T*® -I- r 

\/l + lkll2 


( 10 ) 


where x is drawn from a sign and permutation invariant probability distribution u on the unit sphere 
C and r = (r(l).. .r{d)) is a centred random subgaussian vector with parameter p, that is 
E(r) = 0 and for all vectors v the marginals {v,r) are subgaussian with parameter p, meaning they 
satisfy < e* ^ for all t > 0. We recall that a probability measure u on the unit sphere is sign 

and permutation invariant, if for all measurable sets X C S^~^, for all sign sequences a G { — 1,1}'^ and 
all permutations p we have 


u{aX) = u{X), where aX := {{a{l)x{l),... ,a{K)x{d)) ■. x ^ X} (11) 

v{p{X)) = v{X), where p{X) ■.= {{x{p{l)),..., x(j){K))) ■. x £ X]. (12) 


We can get a simple example of such a measure by taking a positive, non increasing sequence c, that 
is c(l) > c(2) > ... > c{K) > 0, choosing a sign sequence a and a permutation p uniformly at random 
and setting x = with Xp^a{k) = a{k)c{p{k)). Conversely we can factorise any sign and permutation 
invariant measure into a random draw of signs and permutations and a measure on the space of non¬ 
increasing sequences. 

By abuse of notation let c now denote the mapping that assigns to each x G the non increasing 

rearrangement of the absolute values of its components, i.e. c : x ^ Cx with Cx{k) := \x{p{k))\ for a 
permutation p such that |x(p(l))| > |x(p(2))| > ... > \x{p{K))\ > 0. Then the mapping c together with 
the probability measure i/ on x G induces a probability measure Uc on n [0,1]^ via 

the preimage c~^, that is t'c(f^) := z^(c“^(fl)) for any measurable set Q C 
Using this new measure we can rewrite our signal model as 


y = 


^Xc,p,a + r 

\/l + lkll2 


(13) 


where we define Xc^p^aik) = a{k)c{p{k)) for a positive, non-increasing sequence c distributed according to 
Vc, a sign sequence a and a permutation p distributed uniformly at random and r again a centred random 
subgaussian vector with parameter p. Note that we have E(||r|||) < dp^, with equality for instance in the 
case of Gaussian noise. To incorporate sparsity into our signal model we make the following definitions. 
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Definition 3.1. A sign and permutation invariant coejficient distribution v on the unit sphere ^ C is 
called S-sparse with absolute gap /3s > 0 and relative gap As > /3s, if 

u{c,{S)-c,{S + l) <Ps) = 0 and ^ -c^^S + 1) ^ ^ 

or equivalently 

u, (c(5) - c(5 + 1) < / 3 s) = 0 and u, = 0. (15) 

The coefficient distribution is called strongly S-sparse if As > 2pS. 


For exactly sparse signals /3s is simply the smallest non-zero coefficient and As is the inverse dynamic 
range of the non-zero coefficients. We have the bounds /ds < ^ and As < 1. Since equality holds for the 
'flat' distribution generated from c{k) = for k < S and zero else, we will usually think of /3s being of 
the order 0(;^) and As being of the order 0(1). We can also see that coefficient distributions can only 
be strongly 5-sparse as long as S is smaller than that is 5 = 0{p~^) = 0{\/d). 

For the statement of our results we will use three other signal statistics. 


7iA := lEc (c(l) -h ... -1- c{S)) 72,s := lEc (c^(l) -h .. . c^{S)) 



(16) 


The constants 71 ^ and will help characterise the expected size of xl^k- We have 5'/3s < 7 i,s < and 


Cr > 


1 - 

1/1 + 5dp2 ’ 


(17) 


compare Il40] . From the above inequality we can see that Cr captures the expected signal to noise ratio, 
that is for large p we have 


c: 


1 

dp^ 


E(||r||i) • 


(18) 


Similarly the constant 72,5 can be interpreted as the expected energy of the signal approximation using 
the largest S generating coefficients and the generating dictionary, or in other words 1 — 72,5 is a bound 
for the expected energy of the approximation error. 

For noiseless signals generated from the flat distribution described above we have 71^5 = ff~S, (7^ = 1 
and 72,5 = 1, so we will usually think of these constants having the orders 71 ^ = 0{VS), Cr = 0(1) and 
72,s = 0 ( 1 ). 

From the discussion we see that, while being relatively simple, our signal model allows us to capture both 
approximation error and noise. Our results have quite straightforward extensions to more complicated 
(realistic) signal models, which for instance include outliers (normalised but not sign or permutation 
invariant coefficients) or a small portion of coefficients without gap. With somewhat more effort it is also 
possible to relax the assumption of sign and permutation invariance in our coefficient model, potentially at 
the cost of decreasing the admissible sparsity level, the convergence radius and the recovery accuracy and 
increasing the sample complexity. Indeed we will see that the main reason for assuming sign invariance 
is to ensure that when thresholding with the generating dictionary always succeeds in recovering the 
generating support with a large margin and therefore also succeeds with a perturbed dictionary. To a lesser 
degree, especially in the case of ITKrM, the sign invariance also supports the permutation invariance in 
ensuring a richness of signals such that the averaging procedures contract towards the generating atoms. 
In particular the permutation invariance prevents the situation that two atoms are always used together 
and could therefore be replaced by two of their linear combinations. 

However, we will sacrifice generality for comprehensibility and therefore just give pointers in the respec¬ 
tive proofs. 
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3.2 Convergence analysis of ITKsM 

We first look at the more general case of noisy, non exactly S-sparse signals and specialise to noiseless, 
strongly S-sparse signals later. 


Theorem 3.2. Let ^ he a unit norm frame with frame constants A < B and coherence p and assume that the 
training signals Un are generated according to the signal model in (131 with coefficients that are S-sparse with 
absolute gap fds and relative gap As- 
Fix a target error e > 4:e^^p, where 


sk^Vb+t ( _ -^1 _ y 

Cr^i,s ^ V98max{//2^ p2j J 


Given an input dictionary 'k such that 


As 


1J + 


log 


1060 jCyB+l) 

AsCr7i,s 


(19) 


( 20 ) 


then after 6 [log(e ^)] iterations of ITKsM each on afresh batch of N training signals the output dictionary 
satisfies 


d(§, ^) < e 


( 21 ) 


except with probability 

18 riog(e- )lKexp ( 

Before providing the proof let us discuss the result above. We first see that ITKsM will succeed if the 
input dictionary is within a radius 0{As/^/logK) to the generating dictionary <1>. In case of exactly sparse 
signals this means that the convergence radius is up to a log factor inversely proportional to the dynamic 
range of the coefficients. This should not be come as a big surprise, considering that the average success 
of thresholding for sparse recovery with a ground truth dictionary depends on the dynamic range, Il42l . 
It also means that in the best case the convergence radius is actually of size 0(1/-flog K), since for the 
flat distribution Ag = 1 . 

Next note that in the theorem above we have restricted the target error to be larger than 4e^ p. However 
at the cost of unattractively large constants in the probability bound, we can actually reach any target 
error larger than e^^p. 

To highlight the relation between the sparsity level and the minimally achievable error, we specialise 
the result to coefficients drawn from the flat distribution, meaning jdg = 1/'/S- We further assume white 
Gaussian noise with variance = 1/d, corresponding to an expected signal to noise ratio of 1, and an 
incoherent dictionary with p <1/Vd. If S < ^ some £ >2 then the minimally achievable error 

Sp^p can be as small as 0(K‘^~^). 

Last we want to get a feeling for the total number of training signals we need to have a good success 
probability. For exactly S-sparse signals with dynamic coefficient range 1 we have 71^5 = -/S. Omitting 
loglog factors each iteration is therefore likely to be successful when using a batch ot N = 0(K log Ke~‘^) 
training signals, meaning that ITKsM is successful with high probability as soon as the total number of 
training signals used in the algorithms scales as 0(KlogKe~’^log(e~^)). Note that in case of noise due 
to information theoretic arguments the factor seems unavoidable, ||25|. 

To summarise the discussion we provide an O-notation version of the theorem, which is less plug and 
play but free of messy constants and as such better suited to convey the quality of the result. Compare 
also Subsection 13.11 for the O notation conventions. 

Theorem - O ( |3.2| . Assume that in each iteration the number of training signals scales as N = 0(K log Ke~^). 
If S < 0( ^^2iogj^ ) then with high probability for any starting dictionary 'k within distance e < 0(1/-flog K) 
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to the generating dictionary after 0(log(e“^)) iterations of ITKsM, each on afresh batch of training signals, the 
distance of the output dictionary ^ to the generating dictionary will be smaller than 

max |e, O | . ( 22 ) 

Proof: The proof consists of two steps. First we show that with high probability one iteration of 
ITKsM reduces the error by at least a factor k < 1. Then we iteratively apply the results for one iteration. 
Step 1: For the first step we use the following ideas, compare also |j40|: For most sign sequences an and 
therefore most signals 


Vn = - , „ 

vl + ||^n||2 

thresholding with a perturbation of the original dictionary will still recover the generating support In '■= 
Pn^{S), that is ^ = In- Assuming that the generating support is recovered, for each k the expected 
difference of the sum in Q between using the original $ and the perturbation \I' is small, that is smaller 
than (i($, T') = e, and due to concentration of measure also the difference on a finite number of samples 
will be small. Finally for each k the sum in Q will again concentrate around its expectation, a scaled 
version of the atom fk- 
Formally we write. 


'^k = ^ Sign{{'fk,yn)) X{li,n: (^n{k) x{In, k) 


N 


+ (^n{k) X{ln, - ^ (^n{k) xi^n, k)j i^^Vn (^nik) xi^n, k) j 

n \ n / \ n / 


(23) 

(24) 


Since IE (;^ Yhn Vn o'nik) x{kn, k)) = 4>k, see the proof of Lemma B.5 in the appendix, using the triangle 

inequality and the bound ||yn ||2 < VB + 1 we get. 


7 Cryi,s , 
Wk - 


< 


^ XI yn [sign((V'fc, Vn)) x(4,n> k) - an{k) x{In, k)] 


+ 




o I ^ 

< --tt{n : sign((V'fc, Vn)) x(4,n> k) f an{k) x{In, k)} 


K 


N 


+ 




(25) 


Next note that for the draw of pn the event that for a given index k the signal coefficient using thresholding 
with 4^ is different from the oracle signal is contained in the event that thresholding does not recover the 
entire generating support ^ f In or that on the generating support the empirical sign pattern using 4^ 
is different from the generating pattern, sign{{'fk, Pn)) / crn{k) for a k € In, 

{Pn : sign((V^fc, Pn)) xili,n: k) f an{k) x(4, A:)} C {pn : f In} U {pn : sign{^*j^yn) / an{In)}- (26) 

From HOI, e.g. proof of Proposition 7, we know that the right hand side in ( [^ is in turn contained in 
the event £n U iFn, where 


F — 


:= |yn : 3A: s.t. | '^an{j)cn{pnU)){4>j,4>k) > ui or \{rn,4>k)\ > U 2 ^ 


j¥=k 


T — 

'J r\, ' — 


:= |y„ : Sk s.t. Wfe] 'Y^(rn{i)cn{pn{f)){fj,Zk] > uj, or Wk\{rn,Zk)\ > 


(27) 


(28) 


i-yi- 


for 2 (ni + M 2 + ris + ^^ 4 ) < Cn(5') 


Cn{S + 1 ). 


(29) 
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In particular if we choose ui = U 2 = (cn(-S') — Cn{S + l))/7, M 3 = ui — ^ and M 4 = M 3/2 we get that 
En, which contains the event that thresholding using the generating dictionary fails, is independent of 
To estimate the number of signals for which the thresholding summand is different from the oracle 
summand, it suffices to count how often yn £ £n or G Tn, 


tJ{n : sign(('!/’fc, Vn)) k) / an{k) x{In, k)} < tl{M : yn G £n} + tJ{n : yn G Fn}- 

Substituting these bounds into ( |^ we get, 

7 CrXl,S , 

Wk - 


(30) 


^ 2VbTT,,^ . ^ ^ ^ ^ , 


+ 


Vn (Tn{k) Xiln, k) - 4>k 


(31) 


If we want the error between '^fc/UV’fclb and to be of the order ks, we need to ensure that the right 


hand side of pT|) is less than ks ■ 
From Lemma 


B.3 in the appendix we know that 


r (#{n : £ £„} > + <0 j < exp 


(32) 


Next Lemma IB.4I tells us that 


Tro/^ur ^ T- T ^ CrXXsN ^ / ( -tlCrXXsN \ 

Ir njn : G Fn \ > - , • (re + t2) < exp - , - , (33) 

V ^ - 2K^/B^ ^ 7“ V2iT7)BTT(2re + f2)/ 


whenever 


e < 


As 


^ 3 + 7 e(SS?) 


(34) 


Finally by Lemma B.5 we have 
1 


N 




7^ 3“ 


■ (^n{k) ■ x{In, k) - ^Y’^ k>k 


whenever 0 < f 3 < 


n\\2 

VS 

V~B-\-2 


K 


>t3 


Cr7l,S 

K 


< exp 


-nc^iisN , 1 


8SK 


+ 


(35) 


. Thus with high probability we have. 


7 <3*^71,s, 

Wk - 


< —^ + ii + re + i2 + is) • 


(36) 


To be more precise if we choose a target error e > 4e^_p and set fi = e/10/ h = max{e,e}/10, r = 1/10 
and t 3 = e/5, then except with probability 

'-C/7i7iVe2\ 


exp 


/ -CrXl,sF£ 

\120K^/B~+^ 


+ exp 


-CrXi,sNmax{£,e} 


60KVB~+T 


+ 2K exp 


200Si^ 


we have 


max 

k 


7 CrTljS , 

Yk tF rk 


K 


^ Crll,S 3 
< ^ ■ - •max{e,e}. 


By Lemma B.IO this further implies that 

d(^, 4>) = max 


i^k 


- 4>k 


< 0.83max{e, e}. 


(37) 


(38) 


(39) 


Note that in case of outliers we first have to split the sum in (|^ into the outliers, whose number 
concentrates around N the probability of being an outlier, and the inliers for which we can use the 
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same procedure as above, see |[T9l for more details. Similarly the small portion of coefficients without 
(sufficiently) large gap can be included in the small number of signals for which thresholding fails. 

Step 2: From Step 1 we know that in each iteration the error will either be decreased by at least a factor 
0.83 or if its already below e will stay below e. So after L iterations each using a new batch of N signals, 
< max{e, 0.83^d('I', <!>)} < max{e, 0.83^}, except with probability 


L 



f -Cr7i,S^£ \ 

[uoKvlTTi J 


200SK J 


(40) 


Setting L = 6[log(e“^)] and taking into account that the failure probability of each iteration is bounded 
by 3K exp ^ ^ 20 osk^ ) l^^^s to the final estimate. 

One □ 

For most desired precisions Theorem |3.2[ which is valid for a quite large hyper-cube of input dic¬ 
tionaries and a wide range of sparsity levels, will actually be sufficient. However, for completeness we 
specialise the theorem above to the case of strongly S-sparse, noiseless signals and show that in this case 
ITKsM can achieve arbitrarily small errors, provided enough samples. 


Corollary 3.3. Let ^ be a unit norm frame with frame constants A < B and coherence p and assume that the 
training signals pn are generated according to the signal model in with r = 0 and coefficients that are strongly 
S-sparse with relative gap As > 2pS. Fix a target error e >0. If for the input dictionary 4' we have 

i) <-7- ~ (41) 

^(i+vi°s( (A‘”“aid ) 

then after 6[log(e“^)] iterations of ITKsM, each on afresh batch of N training signals, the output dictionary 4' 
satisfies 

d(^, 4>) < £ (42) 


except with probability 


181og(e ^)iTexp 


200SK j ■ 


The proof is analogue to the one of Theorem 3.2 and can be found in Appendix |A.1[ 


Let us again discuss the result. The main difference to Theorem |3.2| is that the condition As > 2pS can 
only hold for much lower sparsity levels, that is S = 0{p~^) and thus for incoherent dictionaries up to 
the square root of the ambient dimension 0{Vd) <C 0{d/ log K). It is also no surprise that once the input 
dictionary is up to a log factor within this radius, ITKsM can achieve arbitrarily small errors. Indeed 
once As > 2pS thresholding is always guaranteed to recover the sparse support of a signal given the 
ground truth dictionary or a slight perturbation of it, ||42|. 

To again turn the corollary into something less technical and more interesting we combine it with the 
corresponding theorem. If the coefficients are strongly S'-sparse the minimally achievable error using 
Theorem 3.2 will be smaller than the error we need for Corollary 3.3 to take over and so we get the 
following O notation result. 


Corollary - O ( |3.3| . Assume that in each iteration the number of noiseless, exactly S-sparse training signals 
scales as 0{KlogKe~^). If S < 0{p~^) then with high probability for any starting dictionary T' within distance 
e < 0{1/y/logK) to the generating dictionary after 0(log(e“^)) iterations of ITKsM, each on a fresh batch of 
training signals, the distance of the output dictionary ^ to the generating dictionary will be smaller than e. 


While a convergence radius of around 1 /i/log K, admissible sparsity levels up to d/logK and a 
dependence of the sample complexity on only K log K is very positive, the dependence of the sample 
complexity on the squared inverse target error for noiseless exactly S-sparse signals is somewhat 
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disappointing. Again note that in the case of noisy signals information theoretic arguments indicate that 
this factor is unavoidable. 


Looking at the proof of Theorem 3.2 we see that the reason for this factor 
is the slow concentration of the sums x{In,k) around the atom (j)k- This can in turn be 

explained by the fact that via the summation we have to cancel out the equally sized contribution of 
all other atoms. Actively trying to cancel out these contributions already before the summation, that is 
summing residuals instead of signals, should therefore accelerate the concentration, and lead to a lower 
sample complexity in case of noiseless signals and better constants in case of noisy signals. We will 
concretise these ideas in the next section. 


4 Dictionary Learning via ITKrM 

There are several ways to remove the contribution of all atoms in the current support ^ except for 
The maybe most obvious way is to consider \k)yn = \k)]yn- Unfortunately this residual 

has several disadvantages, the most severe being that it is not clear whether for the oracle supports and 
oracle signs the corresponding sum of residuals concentrates around a multiple of the atom (j)k, 


IE ( ^ ^ Q{^I„\k)yn ■ (Tnik) ■ X{ln, k) j OC {Q{^I\k) 4>k) OC 


(43) 


We suspect that equality can only hold for tight dictionaries and that an additional constraint such as 
minimal incoherence is needed. We therefore choose a perhaps less obvious but more stable residual 
“n,fc(T') = y-n — ^)yn + P{'ipk)yn, which Captures the contribution of the current atom 4)k as well as 

the approximation error in 4', that is yn — P{^i^ )yn- Replacing the signal means in ITKsM with residual 
means we arrive at the new algorithm, iterative thresholding and K residual means (ITKrM). 

Algorithm 4.1 (ITKrM one iteration). Given an input dictionary T' and N training signals yn do: 

• For all n find = argmax/.|j |=5 ||4'^y„||i. 

• For all k calculate 


n 

Output ^ = ('0i/||Vli||2,...,^Jc/||Vlic||2)- 


Jyn + P{'fk)yn] ■ sign((V^fc,yn)) • x{li,n:k). 


(44) 


Again ITKrM inherits most computational properties of ITKsM. As such it can again be stopped after a 
fixed number of iterations or once a stopping criterion, such as the improvement below some threshold, 
is reached. Only one signal has to be processed at a time, making it suitable for an online version and 
parallelisation. Its computational complexity is slightly larger than for ITKsM because of the projections 
P('k/^ )yn- If computed with maximal numerical stability, these have an overall cost of 0{S‘^dN), which 
corresponds to the QR decompositions of 'Fjs. However, since the achievable precision in the learning 
is usually limited by the number of available training signals rather than the numerical precision, it is 
computationally more efficient to precompute the gram matrix 4'*'!' and calculate the projections via the 
eigenvalue decompositions of 4^^, 4'/=, which is less stable but reduces the overall cost to 0{S^N). Still 
for S > d?P these computations become the determining factor; we will see that S can again be of the 
order 0{p~^ / logK) 0{d/logK). In the next subsection we will analyse which convergence properties 
of ITKsM translate to ITKrM. 


4.1 Convergence Analysis of ITKrM 

As for ITKsM we focus on the more realistic case of non exactly S-sparse and/or relatively noisy signals 
and specialise our results to exactly S-sparse, noiseless signals and moreover the case where S < 0{p~^) 
later. 

Theorem 4.2. Let ^ be a unit norm frame with frame constants A < B and coherence p and assume that the 


training signals yn are generated according to the signal model in (131 with coefficients that are S-sparse with 




12 


absolute gap f3s and relative gap As- Assume further that S < and es := K exp (^— 4741^25 j < 48 ( 5 + 1 ) ■ 
Fix a target error e > with 


8iC2/BTl 

= —75T-exp 


Crll,S 

compare (19(, and assume that e < 1 — 72,5 + dp^. 
If for the input dictionary we have 

As 


-PI. 


98max{/i2jp2| J ’ 


(45) 


d{^, 4>) < 




and d{^, 4>) < 


32\/5’ 


(46) 


then after 12|'log(e ^)] iterations of ITKrM each on afresh batch of N training signals the output dictionary ^ 
satisfies 


except with probability 


60[log(e ^)]i6exp 


(i(§, 4>) < e 

-ChlsNe^ 


576K maxis', i? + 1} (e 4- 1 — 72,5 + dp'^) 


(47) 


(48) 


Proof: The proof follows the same two step procedure as the proof of Theorem |3.2} where in the first 
step we prove that one iteration will reduce the error by a factor k < 1 with high probability and then 
iterate this result. To prove the first step we again use a triangular inequality argument. So we check 
how often thresholding with T* fails. Assuming thresholding recovers the generating support we show 
that the difference between the oracle residual (based on the generating sign and support) using 4> and 
the oracle residual using 'k concentrates around its expectation, which is small. Finally we show that the 
sum of residuals using converges to a scaled version of fk- To keep the flow of the paper we do not 
give the full proof here but in Appendix |A.2[ □ 

Let us discuss the result. First we see that compared to the corresponding theorem for ITKsM we 
need somewhat more conditions. The first two extra conditions on the sparsity level S < and 
48{B + l)e 5 < 1 are technicalities. For all but the most ideal cases they are already implied by having 
a limiting error smaller than one. Since fds < 1/v^ the first condition is implied as soon as is 
larger than B/K, where at best we have pf = The second condition is a substitute for having small 
isometry constant of the generating dictionary Js < j and guarantees that most support sets of size S 
have (57(4>) <i It is implied by < 1 as soon as 13$ is smaller than or equivalently the dynamic 
range of the coefficients is larger than 7. 

The target error can again be chosen closer to the limiting error at the cost of horrible constants. Also 
note that the condition that the target error should be smaller than the expected squared approximation 
error and noise is again a technicality. If both noise and approximation error are so small that a larger 
target error makes sense we get the same result but with a smaller failure probability. To get an idea 
how such a result would look like we refer the reader to the corollary below, where we assume exactly 
sparse noiseless signals. 

The only extra condition that changes the quality of the result is the second condition on the convergence 
radius. Assuming that As = 0(1) the first bound in ( |4^ is of the order 0{\/^J\ogK), so as soon as 
S > logiT, meaning for most practically relevant cases, the second bound will be more restrictive. 
This decreased convergence radius of ITKrM compared to ITKsM is a little disappointing but seems 
unavoidable. The reason for this is that the expected difference between the oracle residuals using 4' and 
<F depends on the operator norms of the rescaled perturbation matrices ||i? 7 || 2 , 2 / compare Lemma B.8 If 
the perturbation dictionary is quasi constant, that is before normalisation Zk = v — P{(j)k)v for some v f 0, 
then 115/11 2,2 ~ VSe for all possible subsets I, so we need e < 1/Vs. 

The advantage over ITKrM is that for low expected noise levels and approximation errors, 1—^2,s+dp'^ <C 


1, we get better constants in the sample complexity. Actually from the probability bound in (481 we can 




















13 


already guess that for exactly sparse, noiseless signals we can reduce the factor in the exponent 
to Before specialising the theorem to noiseless signals we again provide a qualitative result, which 
combines the theorem above with the corresponding theorem for ITKsM in order to deal with the reduced 
convergence radius. That is we first exploit the large convergence radius of ITKsM and run ITKsM to 
arrive at an error 0{l/^/S). Then we exploit the lower sample complexity of ITKrM to arrive at the target 
error. 


Theorem - O (4.2 l Assume that in each iteration the number of training samples N scales as 0{K log Ke ^). 
If S < then with high probability for any starting dictionary Tf within distance e < 0{1/yJlogK) to the 

generating dictionary after 0(log(5)) iterations of ITKsM and 0(log(e“^)) iterations of ITKrM the distance of 
the output dictionary ^ to the generating dictionary will be smaller than 


max 




(49) 


Unfortunately the better constant in the sample complexity of ITKrM disappears in the O notation and 
we cannot really see the improvement over ITKsM. We therefore specialise again to noiseless, strongly 
S-sparse signals. 


Corollary 4.3. Let ^ be a unit norm frame with frame constants A < B and coherence p and assume that the 
training signals yn are generated according to the signal model in fi3) with r = 0 and coefficients that are exactly 
and strongly S-sparse with relative gap > 2pS. Fix a target precision e > 0. If for the input dictionary Tf we 
have d{'^, <1>) < and 

d(T',$)< --- ~ and (50) 

then after 9[log(e“^)] iterations of ITKrM, each on afresh batch of N training signals, the output dictionary ^ 
satisfies 

d(§, 4>) < e 


except with probability 


27i^|'log(e )] exp 


-tIsNs 


UAKmax{S,B} 

The proof sketch can be found in the Appendix |A.3[ 


(51) 


The above corollary clearly reveals the influence of the underlying signal model on dictionary learning 
results. So assuming that the signals are noiseless and exactly sparse and that S is only of the order 
0{p-^) = 0{Vd), we get that one iteration of ITKrM will reduce the error as long as the number of 
samples scales as 0{Ke~^), meaning the influence of the target error is reduced by a factor e~^\ 

Again combining with ITKsM and assuming that the stronger restriction on the convergence radius is 
the second bound in (^l, we get the following quantitative results. 


Corollary - O ( |4.3| . Assume that in each iteration the number of noiseless, exactly S-sparse training signals 
scales as 0{KlogKe~^). If S < 0{p~^) then with high probability for any starting dictionary 'k within distance 
e < 0{1/yJlogK) to the generating dictionary after 0(log(5)) iterations of ITKsM and 0(log(5“^)) iterations 
of ITKrM, each on a fresh batch of training signals, the distance of the output dictionary ^ to the generating 
dictionary will be smaller than e. 


Before a final discussion of our results we first illustrate our theoretical findings with some numerical 
simulations, which give interesting insights into the average convergance radius of the algorithms and 
indicate that in practice ITKrM can be a very powerful low complexity alternative to K-SVD. 
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5 Numerical Simulations 


To complement our theoretical findings, we conduct two small numerical experiments both on synthetic 
and real dataj^ First we test the average case convergence radius and speed of the ITKsM and ITKrM 
algorithm, by running both algorithms on noiseless and noisy training data, using three different types 
of initialisations with varying distance to the generating dictionary 

We generate our training signals based on the signal model in (13|. As generating dictionary <1> we 
choose the dictionary consisting of the Dirac basis and the first half of the elements of the discrete cosine 
transform basis in with d = 256, meaning K = 3* d/2 = 384, which has coherence ^ = \/2/d ^ 0.088. 
Given a sparsity level S, to simulate noiseless, exactly sparse signals we choose c with ci = ... = cs = 
1/V~S and Cfc = 0 for k > S, meaning dynamic range 1. To simulate noisy signals with a higher dynamic 
range we choose a decay parameter Cb uniformely at random in [0.9,1], and let the first S entries be 
a geometric sequence, that is Ck = bo * c\ for k < S and Cfc = 0 for k > S, where bo is a scaling 
parameter ensuring that ||c ||2 = 1. The noise r is chosen as a centered Gaussian with variance 1/d, that 
is r{k) rsj Af{0,l/Vd), resulting in an expected signal to noise ratio of 1. The three different types of 
initialisations are created by first choosing vectors Zk uniformly at random from the unit sphere in 
and then setting 

, _ , , Q{(/k)zk 

Wk — Oi ■ (pk + UJ ■ 11^, , ^ T]- 

WQwkjZkh 


for the ratios a : a; = 1 : 1 and a : a; = 1 : 4. We also consider the completely random initialisation 

V’fc = Zk- 

For each initialisation dictionary we then run 100 iterations of ITKsM and ITKrM with the true sparsity 
level and dictionary size as input parameters, each time using a new batch of 100000 noiseless, respectively 
noisy signals. Figure shows the average convergence respectively recovery rates over 20 trials for the 
three types of initialisations, using noiseless or noisy signals and for sparsity levels 5 = 4,8,12,16. 

For the 1 : 1 initialisations, despite the fact that the corresponding distance between the initialisations 
and the generating dictionary is much larger than the our estimated convergence radius, d('I', $) = 
\/2 — -v/2 PS 0.7654 » 1/yTog^, both algorithms always converge to the generating dictionary, so we 
plot the distance 4)) between the generating dictionary 4> and the output dictionary of the n-th 

iteration Figure j^a/b). As predicted by our theoretical results, using the same number of signals, 
ITKrM always leads to a more accurate estimate than ITKsM. As shown in Figure j^a) for the noiseless 
signals with dynamic range 1 this difference is quite pronounced and especially in the case 5 = 4 <^-1/2, 
the regime of unique sparsity, the precision of ITKrM is limited rather by the machine precision rather 
then the amount of training signals. From Figure [^b) we see that both algorithms are locally stable 
even for the comparatively low signal to noise ratio SNR = E(||r||2)/IE(||4>3;||2) = 1 and coefficients with 
dynamic ranges varying between 1 and 0.9^“'^. 

For the 1 : 4 initialisations, corresponding to distance d('I',<I>) = \J2, — 2/y/VI k, 1.2308 between the 
initialisations and the generating dictionary, we do not always have convergence to the generating 
dictionary. We therefore plot the percentage of atoms recovered by the algorithm, using the convention 
that an atom c/k is recovered if max^ \{'ip^'^\<t>k)\ > 0.99, compare l3l. Gounterintuitively to our theoretical 
results ITKrM turns out to be much more stable to far away initialisations. As we can see from Figure j^c), 
in the case of noiseless signals ITKrM always recovers more than 99% of the atoms, while the recovery 
rate of ITKsM deteriorates quite drastically as the sparsity parameter S increases. To be more precise 
after 100 iterations ITKrM recovers the full dictionary for 17,17,15 and 8 out of 20 initialisations for S 
taking values 4,8,12 and 16 respectively, while ITKsM can only recover the full dictionary in case 5 = 4, 
(15 out of 20 initialisations), and for all other sparsity levels fails every time. The better performance of 
ITKrM is further confirmed by the results for noisy signals shown in Figure j^d). While the recovery 
rates of ITKsM deteriorate further and even for 5 = 4 ITKsM can never recover the full dictionary, 
ITKrM continues to perform well. Indeed for ITKrM we can report a dithering effect, that is a better 


1. A Matlab Swiss knife (mini-toolbox) for playing with ITKrM and reproducing the experiments can be found at http: 
/ /homepage.uibk.ac.at/~c7021041/ITKrM.zip 
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(a) 


(b) 




(c) (d) 




(e) (f) 

Fig. 1. Convergence respectively recovery rates of ITKsM and ITKrM for three initialisation types, 
corresponding to increasing distance to the generating dictionary, and using training signals with varying 
sparsity levels both in the noiseless and noisy case. 
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performance in noisy conditions, as ITKrM recovers the full dictionary 18 out of 20 times for S' = 4 and 
always recovers the full dictionary for the other sparsity levels. 

For the random initialisations we again plot the recovery rates, which confirm the trends observed for 
the 1 : 4 initialisations. Figure j^e/f). While the recovery rates of ITKsM are at best around 73% in the 
noiseless case for 5 = 4 decreasing to around 35% in the noisy case for 5 = 16, ITKrM always manages 
to recover at least 93% of the atoms. Interestingly even though again recovery speed decreases as 5 
increases, both in the case of noiseless and noisy signals the recovery rates increase with 5. So in the 
case of noiseless signals for 5 = 12 and 5 = 16 we again get a more than 99% recovery rate and can 
even report one respectively two full recoveries. In the case of noisy signals the dithering effect is now 
clearly visible and remarkably in case 5 = 8 ITKrM can recover the full dictionary 17 out of 20 times 
and for 5 = 12,16 we always get full recovery. 

Finally we also conduct a small experiment on image data to show that the more promising ITKrM 
algorithm is not merely a pretty toy for synthetic set-ups but indeed useful in practice. In particular 
for two 256 x 256 images, Fabio and Barbara, we take all 62001 possible 8x8 patches, normalise them 
and afterwards subtract their mean, that is we project the patches onto the orthogonal complement of 
the constant atom (pi = 1/8. On these patches we then learn a dictionary of 63 atoms, corresponding to 
the dimensionality of the signals after subtracting the mean (d=K=63). To be precise, we use a random 
initialisation, set the sparsity level 5 = 5, and in each of the 100 iterations use 10000 randomly selected 
patches. Figure]^ shows the two images together with their respective learned dictionaries (including the 
constant atom <pi). 

As we can see ITKrM is able to calculate meaningful dictionaries also on real data. In particular observe 
that even though we have used the same initialisation the dictionary learned on Barbara contains a lot 
more high frequency wave-like atoms, which capture the texture of the scarf. For the sake of conciseness 
we do not go into more details about the approximation performance of the learned dictionaries or 
possible image processing applications here but refer the interested reader to Il23l . ll4TI . Il36l . Instead we 
now turn to a final discussion of our results. 

6 Discussion 

We have shown that iterative thresholding and K-means is a very attractive dictionary learning method, 
since it has low computational complexity, 0{dKN) omitting log factors, can be used in parallel or 
online, has convergence radius 0{1/^J\ogK) and sample complexity 0{K log K£~‘^) for a target error e, 
which reduces to 0{K log K£~^) in the case of noiseless exactly sparse signals. Further to the best of our 
knowledge it is the only algorithm for learning overcomplete dictionaries, that is proven to be (locally) 
stable for sparsity ranges up to a log factor of the ambient dimension - that is recovery down to a target 
error K~^ for sparsity levels 5 up to 0(p“^/(£logAT)) = 0{d/{^logK). 

As such it improves on related results in terms of computational efficiency, convergence radius and 
admissible sparsity level, |[I], or in terms of achievable error and admissible sparsity level, ISJ. In the 
case of noiseless signals, which is the only valid regime for |{T1, the sample complexity is in comparison 
larger by a factor £~^. However, note that there are information theoretic results indicating that in the 
case of noisy signals the dependence of the sample complexity on the inverse squared target error 
is optimal, ESlI . For an overview of results for iterative dictionary learning algorithms see Table 1. For a 
more general overview of theoretical results in dictionary learning see Table 1 in Kl9]l . 

Further we have shown that in synthetic experiments the computationally more involved algorithm 
ITKrM often converges globally, when initalized with a random dictionary. This together with the fact 
that the algorithm is also able to learn meaningful dictionaries on image data makes it an attractive low 
complexity alternative to K-SVD. 

The global convergence behaviour of ITKrM comes partly as a surprise since for ITKrM we can only prove 
a convergence radius of the order 0{l/y/S) as opposed to 0{\/^logK) for ITKsM. It also indicates that 
one might be able to increase the convergence radius of the algorithms by making additional assumptions 
on the perturbation dictionary, that is the normalised difference between the input and the generating 
dictionary, such as good conditioning and incoherence like the random perturbations in our experiments. 




17 



(c) (d) 

Fig. 2. 256 X 256 images together with the dictionaries learned on their 8 x8 patches, (a,c) Fabio, (b,d) 
Barbara. 

All that then remains to show is that most perturbations have this additional property and that one 
iteration of ITKrM conserves the property respectively that an additional small corrective step can restore 
the property. 

For the theoretical results, another slight disappointment hidden in the O notation is, that both the 
convergence radius and implicitly also the limiting precision decrease with the dynamic range of the 
coefficients. This seems unavoidable since the success of thresholding depends on the dynamic range. So 
while we could improve our results to depend on an average dynamic range instead of the worst dynamic 
range by assuming a probability distribution on the dynamic range in our proofs, this average dynamic 
range will remain a limitation. To remove the dependence on the dynamic range we would have to replace 
thresholding by another sparse approximation method such as (Orthogonal) Matching Pursuit or Basis 
Pursuit, which is used in |T|. However, the only method that is known to be on average stable for sparsity 
levels S > Vd is Basis Pursuit, |146], and it will need some work to extend the corresponding results 
to perturbed dictionaries, noise and approximation error all at the same time. A maybe less daunting 
strategy is to extend the stability results for thresholding to iterative (hard) thresholding methods, [9], 
IjlOl , IIT6l . Il24] . Another strategy to overcome large dynamic ranges, we are interested in, which would 
at the same time remove the requirement of knowing the exact sparsity level, is to extend our results to 
the case where we can only assure a gap between cs and cs+t for T > 1. 

The most important research directions are concerned with the globality of the results. To get to an 
efficient algorithm we need to find initialisation strategies, such as in 0, that remain cost efficient also 
for sparsity levels S = /{£logK)). An alternative strategy, we are currently pursuing, is based on 
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‘valid for Algorithm 2. Algorithm 5 seems to achieve similar errors as ITKsM/ITKrM but at significantly higher 
computational cost. Further the dependence of its sample complexity on the target error is not made explicit. 

TABLE 1 

Comparison of theoretical results for iterative dictionary learning algorithms. 


the earlier mentioned additional assumptions. If the perturbation dictionary not only has a flat spectrum 
but is itself incoherent and incoherent to the generating dictionary we expect one step of ITKrM to reduce 
the perturbation sizes but to keep the perturbation directions roughly the same. Estimating the volume 
of 'good' perturbations we could then calculate the probability that a random initialisation is successful 
or, in case this probability is too small, add a corrective step that restores the good properties of the 
current iterate. 
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Appendix A 

Proof Sketches 

A.1 Proof of Corollary [3^ 

The proof is analogue to the one of Theorem |3.2[ We only need to take into account that without noise we 
have Cr = 1 and that in all estimates the constant B + 1 can be replaced by B, since for noise free signals 
Un = we have \\yn \\2 < B. Further since the coefficients are strongly 5-sparse, thresholding 

using the generating dictionary <1> will always (almost surely) recover the generating support with a 
margin Us > {As - 2fj.S)cn{l), that is min^g/^ \{4'k,yn)\ > K</>fc,?/n)| + Us, compare 1^. Therefore 

the event that thresholding using 'F fails or that the empirical signs differ from the generating ones is 
contained in 


and we get 


T"" := 


|yn : 3A: s.t. Uk 


^ 0-n{j)Cn{Pn{j)) {4>j, Zk) 
3 






1,5 


4^k 


< • Un G ^n} + 


1 

N 


'^ynO-n{k)x{In,k) 

n 


71,5 

K 



(52) 


(53) 


which can be estimated as before. 


A.2 Proof of Theorem IQ] 

As already mentioned we use the same two step procedure and ideas as in the proof of Theorem (3.21. 
Step 1: We first check how often thresholding with 'F fails. Assuming thresholding recovers the generating 
support we show that the difference of the residuals using $ or 'F concentrates around its expectation, 
which is small. Finally we show that the sum of residuals using <F converges to a scaled version of (pk- 
To make the ideas precise we define the thresholding residual based on T' 

R\^,yn,k) := [yn - P{^Jyn + P{i^k)yn] • sign((V’fc,yn)) • 7:(4,n>^) (54) 

and the oracle residual based on the generating support In = p“^(S), the generating signs an and 'F. 

yn, k) := [y„ - P{^!i^)yn + P(V’fc)yn] • crn{h) ■ x(/n, k). (55) 

We can now write. 






1 

N 


Y, yn, k) - yn, ^)] + ^ E Vn, k) - R°{^, yn, k)] 

n n 

+ ^ • O'n(fe) • X{ln, + 

n \ n / 

Abbreviating Sk = Y^niVn, 4>k) ■ crn{k) ■ xiP, k) we get 
Skft^kh < ^1 X [R\^^yn,k) - R°{^,yn,k)\ ^ 

n 

^\Y[R°{^.yn,k)-R°{^,yn,k)\\^ 

+ ^11 X • 0'n(A;) • x(/n, fc) (57) 


+ ivl 
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We first estimate the norm of the first sum using the fact that the operator + P{'4’k) is an 

orthogonal projection and that \\yn II 2 < ^/B + 1, 

2/BTT 


“ B°{^^Vn,k)] 


< 


N 


tj{n : Vn, k) / R°{^, A:)}. 


(58) 


Next note that on the draw of yn the event that the thresholding residual using 'I' is different from the 
oracle residual using {y„ : ,yn,k) / A:)} for any k is again contained in the events 

£n U Pn as defined in ^7 )/( [^ , 

{yn ■■ R^{^, yn, k) / R°{'^, yn, A:)} C {y„ : „ / /„} U [yn : sign(^'j^yn) 7 ^ cFn{In)} C <5n U Pn- (59) 

Substituting the corresponding bounds into (|^ we get. 


X II ^ 2^B + 1 ^ c 1 I ‘2‘^B + l ^ T7 1 

- SkCpkh < --BW : Vn G £n\ H--tlw : Vn G Pn\ 

+ ^\\^[B°{^,yn,k) - R°{<^,yn,k)] ^ + ^^\^[yn - P{^ljyn] ■ (^nik) ■ xiln,k) (60) 


For the first two terms on the right hand side we use the same estimates as in the proof of Theorem 3.2 


To estimate the remaining two terms on the right hand side as well as Sk we use the corresponding 

NPrP'y'^ \ 


lemmata in the appendix. From Lemma B .6 we know that 


h X ^)^n{k){yn, 4>k) 


N 

n 

From Lemma 

... / 1 


< ( 1 -io) 


CrXly 

K 


< exp — 


2K{1 + ^ + 5p2 + toCrXi,sVBTi/3) 


(61) 


B .8 


we get that if 5 < min{^, g^}, e < and < 


32 VS 


N 


^[R’^{^,yn,k)-R%<^,yn,k)] 

n 

< exp — 


> ^^!^(0.381e + f3) 




40iTmax{5,S + l} 


mm 


24(B+1) 


As 


then 


e2 + £5 (1 - 72 ,s + dp 2 ) / 160 ’ 3 


>7 +7 


(62) 


Finally from Lemma |B.7| we know that for 0 < ^4 < 1 — 72,5 + dp^, we have 


( X - P{^p)yn] ■ CTnik) ■ x(4, k) 

\ n 




< exp — 


K 

tlChlsN 


8K maxis', + 1} (1 - 72,5 + dp"^) 




(63) 


Thus with high probability we have 


\'4^k 'Sfc*^fc||2 — 


Crp,S 


(e/^jP + Ai + T£ + f 2 + 0.38l£ + As + A 4 ) and 7 (1 — to) 


Cr7l,S 


^ ^ . (64) 

To be more precise, if we choose a target precision e > Se^^p and set fi = 5/24, A 2 = A 3 = max{e, e}/24, 
r = 1/24, t 4 = e /8 and to = 1/50 we get 


max 

k 


7 Cr'yi,S , 
Wk - 


/nc ^rll,S r~ T , . ^ n no P'rll,S 

<0.8-———max{e,e| and mmsfe>0.98-———. 
K k K 


(65) 


except with probability 

-Crll,sN£ 


exp 


+ 2 iT exp 


‘imK^/WTi 

-ChlsNe^ 


+ exp 


512/6maxis', + 1} (1 — 72,5 + dp"^) 


~Cr 7 i,sN max|e, e} 
144i6\/F+t 

+ 2K exp 


+ K exp 


-ChlsN 


16(5103 + 34 CrXxs\fBTi) 

—C/72gNmax|e, e}2 \ 


576/6 max|5, /? + 1} (e + 1 — 72,5 + dp^) j 
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Note that in case the target precision e is larger than es, as happens for instance as soon as Ps < 
and therefore > es, the last term in the sum above reduces to 

gN max{e, e} \ 


^ y 576itr max{5, B + 1} {2 — j 2 ,s + dp'^) J 
Lemma B.IO then again implies that 

d(^, < 1 >) = max 




- (I>k 


< 0.92 max{e, e}. 


( 66 ) 


(67) 


Step 2: The second step is analogue to the one in the proof of Theorem 3.2 


A.3 Proof of Theorem [431 

We follow the proof of Theorem |4.2| but take into account that in case of exactly S'-sparse, noiseless signals 
the bound reduces to 


- SkCpkh < • Un ■■ Vn G K} + ^11 XI “ R°{^^yn,k)\ 


N 


( 68 ) 


Since the relative gap A 5 > 2pS we get 63 < pS <\ and by Lemma 
IP : yn G 2Fn} > + * 2 )! < exp 


B.4 


2 KVb 




2K\/B (2re + t 2 )J 


whenever 


e < 


As — 2pS 




ST 


Further by Lemma B .8 

f 1 


N 


Y,[R°{^,yn,k)-R°{<^>,yn,k)] 


tsllsN 




(69) 


(70) 


and again by |B. 6 | 
1 


N 


'^x{In, k)an{k){ynAk) 


^ + ^CrXl,s\ , 

< (1 - to) —— I < exp 


Nth! 


K 


min 1 j> + I I , 


• (71) 


2K{1 + p^{S - 1) + to7i,5\/^/3) 


Thus with high probability we have 


llV'fc - Sfc</>fc ||2 < ^ (7'e + i 2 + 0.611e + f 3 ) and Sk > [I - to)'^^. 

The final result follows as before from setting to = 1/50, r = 1/24, t 2 = max{e, e}/24 and is = 2 t 2 - 


(72) 


Appendix B 

Probability Estimates & Technicalities 

Theorem B.l (Vector Bernstein, E5I , [23, [291 1. Let {vn)n G fl finite sequence of independent random 
vectors. i/||fn ||2 < M almost surely, ||E(i;,i )||2 < mi and X]nII^(ll^^^ll 2 ) — then for all 0 <t< m 2 /(M + mi) 


P 



n 2 



< exp 


t^ i\ 
8 m 2 4 / 


(73) 
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and in general 




t 


> t < exp I -• min < -, 


8 


t 


1 


1712 M + mi 


1 

+ 4 


(74) 


Note that the general statement is simply a consequence of the first part, since for t > m 2 /{M + mi) 
we can choose m 2 = t{M + mi). 

For the simple case of random variables we also state a scalar version of Bernstein's inequality leading 
to better constants. 

Theorem B.2 (Scalar Bernstein, |(S|). Let Vn G M, n = I... N be a finite sequence of independent random 
variables, if E(ti^) < m and Edu^l^) < ^k\mM^~'^ for all k > 2 then for all t > 0 

^ ^ E(u„) >t] < exp ( - 


2{Nm + Mt) ^ 

Lemma B.3. For yn following model with coefficients that have an absolute gap fts we have, 


IP ■■ Vn G £n} > ^ ( 


/ -t‘^Crji,sN 


2K^/ B + 1 {2e^^p Ft)) 


(75) 


where £up= 

UtP C,7i,s 


exp 


-hi 

98max{/i^,p^} J' 


Proof: We apply Theorem B.2 to the sum of indicator functions to get 


IP f tt{^ : yn G £n] > ^E(£’n) + tN j < exp 


,F{Sn) + tNj • 


(76) 


To estimate F{£n) we apply Hoeffding's inequality to ( |27| resp. use the subgaussian property of r„. 
Omitting subscripts for simplicity and abbreviating u = c{S) — c{S + 1) we get, 

E(-5) < ^ I - 7 ) 


j¥=k 


< 




u 


exp 


.98 Ejyfcc(p(j))^|((/)j,4)d 


+ 2K exp 


—u 


98p^ 


< 2K exp 

< 4K exp 


-^s 


zM 

98p^ 
Crll,S 


98max{p2^ p2|y 2Ky/B~+l 
The result follows from the substitution t —)■ 

Lemma B.4. (a) For yn following model with coefficients that have a relative gap As we have, 

™^ur Crll,sN , ^ ( -t^Crll,sN \ 

V ^ - 2K^/B£^ ^ 7- ^ V2K7lBTl(2re + t)y ’ 


(77) 


□ 


(78) 


whenever 


e < 


As 


^ i + ^g) a‘cE,7’ ) 


(79) 
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(b) For Un following model (131 with coefficients that have a relative gap As > 2pS we have, 


P ( ‘^{n : Un G _ • (re + t) \ < exp ( _ ^ 

in yn - 2 kVB ^ 7“ \2K^/B {2Te+ t)) 


whenever 


e < 


As — 2pS 


( i + A/log 


19K'^B 

(As-2/iS)7i,sr 


(80) 


(81) 


Proof: We apply Theorem B.2 to the sum of indicator functions Ijp(s) to get 
P : yn G -^77 > ^ P(4')) + < exp 




(82) 


To estimate P(4*^) we again apply Hoeffdmg's inequality this time to (|^/ ( 52 1 resp. use the subgaussian 
property of r„. Omitting subscripts and using the short hand u = c{S) — c{S +1) and Ug = {As — 2pS)c{l) 
we get. 


P(J^) < E«’h’'=lE (^{j)c{p{j)){4>j,Zk) 


j¥=k 


^ u e^c(5) 


E I , . I u s'^ciSi 

P fu;fc|(r,Zfc)| > —- — 

U \ 


< 2 exp 

k 


- 


' 1 

+ 2K exp 

f- 


1 

^ 984 E j/fc c{p{j))^\{fj,Zk)f ^ 

[ 

4 • 98p2 

/ 


< 2K exp 

< bK exp 


V 



' 1 

+ 2K exp 

(- 


7 

min{c(l)2il, 1} 

4•98e2p2 1 


/ 


V 


-{c{S)-c{S + l)f\ -Al\ 


< 5K exp 


V98e25 ) 


98e2c(l)2S 

From Lemma A.3 in Il39l we further know that condition ( [7^ implies 

^\98e‘^B) - 2iT7F+l ’ 

and the result in (a) follows again from the substitution t —)■ 2 k"Jt^ 
Similarly we get 


(83) 


(84) 


P(-E^) < ^k\'Y^(y{j)c{p{j)){fj,zk) 




^ ih _ e^c{S ) 
“2 4 


< 2K exp 


f- 

((A5 - 2pS)cil) - 

1 

< 3K exp 

f-iAs-2pS)^\ 

^ 71,5 

V 

8e2 min{c(l)2il, 1} 


^ 8e^B ) 

2Ky/B 


• re, (85) 


7i,s 


whenever ( |8T) holds and the result in (b) follows from the substitution t 2 kVb 
Finally note that another (messier) way to bound Yl,j^k^{p{j))^\{f 3 Fk)f is 


t. 


c{p{j)f\{fj, Zk)f < min{c(l)74'/||i,2 + 1 “ 72,s, c(l)7^/ll2,2 + c{S + ifB], 

j¥=k 


( 86 ) 


However, in the case of exactly S-sparse signals these can lead to better (and again clean) estimates, such 
as c(l)^(l + pS) or c(l)^(l + 6s) if 4> has isometry constant < 1. □ 
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Remark B.l. The last two lemmata are used to prove that, once the perturbed dictionary T' is within 
radius 0{1/\og{K)) of the generating dictionary thresholding will always succeed in recovering the 
full generating support, even for S = 0(/i“^). Without assuming random signs, we can still get that 
thresholding recovers the generating support once 'I' is within radius 0{l/^/S) for reduced sparsity 
levels S = 0(/i“^). 


Lemma B.5. For y-n = 


Vi+lh"ll2 ’—’ Vb+2 


we have 




N 


\/l + 


F rn C'p7i S' 

■ ^ri{k) ■ x(/n, k) - 


' ri\\2 


Proof: We apply Theorem 


B.l 


K 


, ^Xc V 

to Vn = 


> 


Cr 7 l,S 

K 


t^ChlsN 1 


t < exp — 


8SK 


+ 


(87) 






<^n{k) ■ x{In,k). Since the Vn are identically 


distributed we drop the index n tor our estimates. Remembering that I = p ^ (S) we get, 

E{v) = Ec,p,a,r ( '^ic(p(j))(T(j) • a{k) + r • a{k) 


= E, 


x(S,p{k)) • c{p{k)) 


c,p,r 


+ 


<t>k 


= Ep 


\/l+^ 


Ep 


c(l) + ... + c(5) 
K 


Crll,S 


K 


( 88 ) 


and ||E(r ;)||2 < VSjK. Together with the estimates, 

E (lIHli) = E r) + ||r||2)^ = E (x(/, k)) = | 

II II ^ ll^3;c,p,tT + J’lb ^VB+\\r\\2 

this leads to 


\v\\2 < 




< 


-x/i + 


< y/ B + 1, 


-E 

N ^ 




c„,p„,o-„ + ,, s CrXl,S , 

■ (^n{k) ■ X{ln, k) - 


7/1 + 


' n\\2 


K 


, t'^KN 1 
> t \ < exp-^ 7 :;-h - 


8S 


(89) 


for 0 < f < 


iC(v/5+T+f) 


. The final statements follows from the substitution t 


a7i 


— t and simplifications. 

□ 


Remark B.2. Note that for Eq. (881 in the above proof, we have used the sign invariance in our model 
but not the permutation invariance. For very small sparsity levels we can also get a stable version of the 
lemma using only the permutation invariance. Assume for simplicity that <1> is an orthonormal basis and 
that the sparse coefficients are constant, = c for k < S and zero else. In this worst case scenario where 
the signs never cancel out we get 


E(ti) = c 



5-1 

d-1 



and 


||E(u )||2 = c 


1 + 


d-1 


(90) 


which implies that the atoms can be learned up to a precision 0{S‘^/d). A relaxed condition replacing sign 
and permutation invariance could be that the coefficient sequences x satisfy E (x(j) sign(x(/c))|A: G /) <C 
E(|x(A:)||fe G I) for I containing the indices of the 5 largest coordinates in absolute value, that is 
minjg/ |x(z)| > max^y/ k(j)l- This condition is quite natural as it basically prevents two atoms fk and 
cj)j from always appearing together in the same ratio x{k) : x(j) = a : b. In this case they could simply 
be replaced by two copies of the same atom, fj = fk = afk + bfj which would increase the response 
criterion on which ITKsM is based, see ||40l. 
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Lemma B.6. For yn = ™ model we have 


N 


'^x{In,k)an{k){yn,(t)k) 


< (1-i) 


Cr7l,S 


K 


< exp — 


Nt^Chls 


2K{1 + ^ + Sp2 + tCrii,sy/BTi/^) y ■ 


(91) 


Proof: We apply Theorem B.2 to Vn = xikn,k)an{k){yn,(pk), as usual dropping the index n in the 
estimates for conciseness. For the expectation we get 


— ^c,p,cr,r 


Xil,k) 

\/l + \\r\ 


'^c{p{j))cr{j){(pj,(j)k) ■ cr{k) + {r, fk) ■ cr{k) 


= E, 


c,p,r 


x(S,p(A;)) • c{p{k)) '\ _ Cr-fi,s 
Vl + llrl|2 } ~ K ■ 




We further estimate the second moment m as 


E(?;2) =Kc,p,a,T j2 {Y^c{p{j))a{j) {fj , fk) + (r 

< ^c,p ( X(k,k) ■ I J^c(p(J))^l(fj,fk}l^+K {\{r,(/)k)\ 


(92) 


<E,,J x(/,A:)- I ^ + |(</>i,4)P + p' 


s fX2,S , B 


< — 




+ 17+P^ 


K \ S K 


(93) 


In the case of exactly 5-sparse signals, where 72,5 = 1 we get the alternative bound, E (ti^) < _|_ ( 5 " _ 

l)p? + Sp2). Since |?;| < \{y.,4>k)\ < llylb < \/B + l we can choose M = □ 


Lemma B,7, For yn = 






as in model 


Pi^ljyn) ■ (Tnik) ' xi^n, k) 


- K 


< exp — 




max 


8 iT max{S, B + 1} 11 ~ 72 ,s + dp 


1 + 


(94) 


Proof: We apply Theorem B.l to Vn = {yn — P{^if)yn) • (rn{k) ■ x{dn, k). For brevity we again drop the 
index n in the estimates and define the orthogonal projection Q{^i) = Id — P{^i). For the expectation 
we get 

E{v) = 'Ec,p,a,r ( <3(4>/) ( ^ fjcipiJ))a{j) ■ a{k) + r • a{k) 


= E 


c,p,r 


X{l,k) 

v/1 + M 


c{p{k))Q{<^i)4>k = 0, 


(95) 
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and for the second moment 


lE(lklli) = 


X{l,k) 

1 + Ikll^ 




) — ^c,p,a,r ^ 

<E,,p ixihk)- [^c{p{j)f\\Q{^i)^^\\l + W.r 


< ^c,p k) ■ c{p{j)Y + min{l, (d - 

Since v is bounded, 

\\Q{<^l){^Xc,p,a + r)\\2 ^ V^(l - 72 ,S,min) + ll^lb 


< ^ • (l - 72,s + dp'^) . 


Iplb < 

we get for t 


v'l + lklli 

Ct-7i.s f 


\/l + 


— V ~ 72,S,min) + 1 < VB + 1, 


(96) 


(97) 


K 


Pi^lJVn) ■ CTnik) ■ xiP, k) 




K 


. , tCrXxsN 

< exp (-—— max 

^ ( tChlsN 

< exp I- — -max 


tCrXxS 


1 


S{1- 72,s + dp"^) ’ + 1 

t 1 


1 

+ 4 


-S'(l - 72,s + dp"^) ’ CrXi,sVB + 1 




The final bound follows from the fact that Cr < 1 and 71^5 < VS- 


(98) 


□ 


Remark B.3. Note that for the abov lemma neither the sign nor the permutation invariance are crucial. 
Without both assumptions we could still get a stable version of the lemma because we can bound E(ti) 
by the residual energy ||?/n — T’(‘l*/„?/n|| 2 / which should be small if the signals are assumed to be S'-sparse. 
To get perfect recoverability E(r;) we could make the natural assumption that in expectation the residuals 
a-n = Vn — P{^i„)yn = Q{^i^)^Xn are uncorrelated with the sign of the k-th coefficient Xn{k.) whenever 
A; G , E (o„ sign(x(/c))x(/„, k)) = 0. Indeed if this is not the case it means that the signals can be even 
better sparsely approximated if the atom cjik is distorted towards this signed residual mean. 

follows the random model in ( |T^ . Assume S < min{g^, gg^} 


Lemma B.8. Assume that pn = 
and d{^,^)=e<^. 

(a) If es := Kexp ^ 




< 


1 


4741/^=5 J — 48(S+1) 


we have 


N 


^[R'^i^,yn,k)-R%<I>,yn,k)] 


> (0.381e + t) 


K 


, , tCrJxsN . 

< exp (- — -mm 


AC'r 71,S 


S [5e2 + £5 (1 - 72 ,s + dp 2 ) /32] ’ 3^5 + 1 


1 

+ 4 


(99) 


(b) If X 2 ,s = 1) P = 0 together with es < or ds(<i>) <1/4 this reduces 


to 


N 


Y,[R%'^,yn,k)-R°{<^,yn,k)] 


> (0.3816 + t) 


K 


< exp — 




32eK max{5, B} 


t 


min ■( -, 1 7 + 7 


1 


( 100 ) 
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(c) If 72,5 = l,p = 0 and 6si^) < 1/2 we have 
„ / 1 


N 


n 

< exp — 


> ^(0.611e + l)^ 




32eK max{5, B} 


t 


min < -, 1 > + - 


1 


( 101 ) 


Proof: We apply Theorem B.l to Vn = R°{'^,yn,k) — R°{^,yn,k). Again we drop the index n in the 
estimates. Remembering the definition of R°{'^,yn,k) in we first expand v as 


( 102 ) 


V = {yn- P{'I>lJyn + P{'fk)yn) ■ (^n{k) ■ x{In, k) - [yn “ P{^I^)yn + P{(l>k)yn) ' (^nik) ' x{In, k) 
= [P{^i) - P{^i) - P{fk) + PiA)] y • cT{k) • x{i, k). 

Abbreviate T{I, k) := Pf^i) — P{'^i) — Pifk) + Pi'fk)- Taking the expectation we get 

E{v) = Ec,p,a,r ( XI 


= E, 


c^p^r 


II' 112 

x{I, k) • c(p(fc)) 

, \/l + 11^112 


[P{^l) - P{^l) - P{fk) + P{^k)] fk 


CrlXS (K - 1 

K [s-l 


-1 


[P{^Pk)-Pi'I'I)]fk. 


(103) 

|/|=S,fce/ 

We next split the sum above into a sum over the well-conditioned subsets, where (i/(<l') < Sq, and the 
ill-conditioned subsets, 5/(<l>) > 6o, 


E(r;) = 


/A-l\ 

-i/ 

U-iJ 

V 


[p(V,,)-P(vI/,)]4+ [P{i;k)-P{^l)]cbk 

\ 1/1=5,fee/ 

\ <5(«>/)<5n 


l/|=s,fce/ 

^{^/)>^o 


(104) 


We further expand the sum over the well-conditioned sets using Sublemma B.9 

[P(V'fc)-P('k,)]4= X iPi^i)bk + yi,k) 


|/|=5,fee/ 

<5{«>j)<50 


|/|=s,fce/ 

5(^j)<Sq 


{^i^*ibk + [Pi^i) - bk + VI,k) 


|/|=s,fce/ 

5(^/)<5o 


Y ^I^*ibk- X X i[Ri^i)-^i'^*i]bk + Vi,k) 

\I\=S,kGl 


|/|=5,fce/ 

5(<i>/)>50 


l/|=s,fce/ 

5(^/)<^0 


K -2 
S-2 


Y X i[P{^i)-^i*^1]bk + Vi,k), 


(105) 


|/|=s,fce/ 

S(^j)>Sq 


|/|=5,fee/ 

S{^i)<Sq 


where for the last equality we have used that {bk, 4>k) = 0. Substituting the last expression into ( 104| | we 

geh 


E{v) = 


Cr'Vl,S 

K 


S-l 

K-1 


+ 


A- 1 
5-1 


-1 


i[Pi<l>i)-<l>i^*j]bk + vi,k) 


|/|=5.fce/ 

5(^j)<5o 


-t 


A- 1 
5-1 


-1 


{[P{i;k)-P{'I'i)]fk-<I>i^*ibk) 


|/[=5,fee/ 

<5(4>j)><5o 


(106) 
























28 


Substituting the bound ||P(<b/) — < <5(<b/) < (Jq as well as the bound for ||?//,a :||2 from Sub¬ 

lemma IB.9I for the well-conditioned subsets and the bound 


II [P(V'fc) - P{^i)](t)k\\2 = \\P{^i)Q{'iPk)4>kh < \\Q{'^k)4>k\\2 = V^i - KV’fc,</>fc)p < £k 

for the ill-conditioned subsets finally leads to 


(107) 


P(^)l| 2 < 


CrJl,S 

K 


S-l 
K -1 


11 112 + <^011 112 + 


2ey/S 


^[l-6o){l-^i)-2eVS 


Wbkh + £\\bkh 


+ >5o:\I\ = S,kGl)- (Ek + B\\bkh) 


< 


C*r7l,5 

K 


SB ^ 

— + 6o + e + 


2eVS 


{l-6o){l-^4)-2eVS 


+ {B + 1)P(<5(^>/) > 5o : |/| = A: G I) 


If (^5 < kr we choose 5o = Ss, which for S < and e < leads to 


Mv)h < 0-611e 


32V5 

Cr'yi,S 

K ' 


Whh- 

(108) 

(109) 


In the non-trivial case, where $ does not have a uniform isometry constant (5s < ^, we can estimate 
( 108| l using J. Tropp's results on the conditioning of random subdictionaries. Reformulating Theorem 12 
in Il46l for our purposes we get that 


P(5($/) > ,5o : |/| =S)<e- 


for 


s = 


14V5 


( 110 ) 


whenever e s > log(5/2 -|-1) and 5 > 4. Together with the union bound. 


Fi5i<^i)>6o:\I\ = S,k€l) = 


< 


K -1 
S-l 
K-1 
S-l 


-1 


^{I :5{<^i)>6o,\I\ = S,k€l} 


K 


«{/ : (5($/) > (5o, \I\ = S} = -- P((5(c^ 7) > (5o : \I\ = S), (111) 


this leads to 


P(( 5 (T>/) > < 5 o : |/| = 5 ',fe G /) < max<j 5 , —| exp (- 144^25^ - 


( 112 ) 


whenever e“^/^(5o > - in case one of the other original conditions is violated the statement is trivially 

true. Using the assumption S < g^, which does not represent a hard additional constraint, considering 
that in order to have < 1 we need S < g^ and that ^ ~ we get for 5q = 

P ^5(T>/) >^:\I\= S,kG < iTexp — ^s, (113) 

Substituting this bound for the choice (5o = j into ( 108| l and using that e < and es < 4 g(g_)_i) we get 

Cr'yi,S 


||P(u)||2 < 0.381e 


K 


(114) 
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The second quantity we need to bound is the expected energy oi v = T{I, k)y ■ a{k) ■ x(/, k), 
lE(lklli) =IEc,p,<7,r • T{I,k)(^Y^cj)jc{p{j))a{j)+ r'j 

( E c{p{j)f\\T{I, kYjWl + \\T{I, k)r\\l 


= K, 


c,p,r 




p,r 




< 


Ep[x{I,k)[^^\\T{IY)<Pj\\l + 


j&i j^i 

2,1- 72,S 


K-S 
i 6 ^ 


j2\\ni,k)Y\\i+Er {wniYHi] 


(115) 


We first estimate the two sums above given that k ^ I. Note that we always have \\P{4>k) — P{'4^k)\\2,2 < ^k 
and ||P((/>fc) — P(V’fc)||F < \/2efc- Thus we get for the sum over I, 

Y, wm, kYjWi < E + \\[pm - pi^mh? 

j&I 16 / 

= E iWQi^iMjh + \\[pm - pii^kmjhf 

16 / 

< E (WQi^jMjh + \\P{4>k) - Pmh,2f < E (^1 + ^ 45e2, (116) 

16/ le/ 

and for the sum over the complement P, 

Y\\ni,kmi = \\T{iY)^i4l<\\ni,ml\\^^^^^^^ ( 117 ) 

To estimate the noise term in ( 115| l we use the singular value decomposition of T{I,k) = UDV*, 

E (||r(/, k)r\\l) = E (II WVIli) = (^E ^ E (118) 

where for the inequality we have used that for a subgaussian vector r with parameter p, the marginal 
{vi,r) is subgaussian with parameter p. Substituting these estimates together with the bound \\T{I,k)\\F < 
||P(T>/) - P(T'/)||i;’ + ^/2ek into (|115[) we get. 


E(||7;||^)<Ep x(/,A:) 


472,56 + 


2 , f ^(1 - 72,s) , 2 


V K-S 


+ pA (\\Pi^i)-P{^i)\\F + V 2 eky 


(119) 


As for the estimation of E(ti) we now split the expectation over p into the well and the ill-conditioned 
subsets I = p“^(§). By Lemma A.2 in Il39l . whenever 5($/) < (5o, we have 


||P($z)-P(vI/,)|||.< 


2\\Qi^i)Bi\\] 


\/l - <^o (\/l - <^o - 2||-B/||f) 


( 120 ) 


which for e < “ 1/^ (resp. Ss < 1/2) simplifies to ||P(<1>7) — P(T' 7 )|||, < Together with 
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the general estimate ||P(<&/) — P(^'/)||i? < V^, this leads to 


Edklli) < I 


A 2,1 -^(1 72,s) 2 
472 ,se + ( _g + P 




BSe + \/ 2 efcJ 

B{1 - 72,s) ^ 1^25 + 2skVS 


K-S 


S 

< — 
- K 


472 ,+ 15e^ 


SB 

K-S 


(1 - 72,s) + Sp‘ 


+ P > 1 : |I| = fc e (1 - ^2,5 + dp^) 

Substituting the probability bound from ( 113| l and assuming again that S < as well as that S < 
leads to the final estimate 


S 


lEdlt-lli) < ^ + dp^) 

Last we bound the norm of v in general as 

ll^lb = \\[P{^i) - P{^i) - pm + Pm]y \\2 < 2\\yh < 2VBTi. 

In case 72,5 = 1, p = 0 and therefore y = ^jxj this reduces to 

Ibib < \\[^I - P('I'/)<l>/||F|k/||2 + ||-P(</>fc) - Pii’k)\\2,2\\^lXl\\2 

< Ui - + eVb < e (Vs + Vb') , 

and in case of uniform isometry constant (is(‘I’) < 1/4 and e - 32v/S 

ll^^lb < ||[P(4>/) - P(4/,)||F||y||2 + \\pm - m)l|2,2||y||2 < eVbTi (V^+l 

Putting all the pieces together we get that under the assumptions in (a), 

™ f 1 


( 121 ) 

( 122 ) 


(123) 


(124) 


N 


^[R°{^,yn,k)-R°{^,yn,k)] 


> ^^^^'^(0.381e + f) 


K 


< exp — 


< exp — 


iC'r7l,S^ . 
8 iL ““ 

tchm 


iC'r7l„ 


1 


5 [5e2 + (1 _ ^3,5 + dp'^) /32] ’ SVB + 1 

t 3 


40iL max{5, B + 1} 


mm 


e2 +£5 (1 - 72,s + (ip2) /I6O’ 5 


1 

+ 4 


under the assumptions in (b), 
„ / 1 


N 


^[R°i^,yn,k)-R’'{<^,yn,k)] 


> (0.38l£ + t) 


K 


^ / tCr7i,sN 

< exp--min 


8K 


tCrJl^S 1 

4e2S ’ 8eVS{B + 1) 




tChlsN 


< exp — 


32eK maxis', B + 1} 


min I -, 1 I + 


1 


and under the assumptions in (c), 

„ f 1 


N 


Y,[R%^,yn,k)-R°i^,yn,k)] > ^(0.611e + t) 


K 


< exp — 




AOeK maxis', B + 1} 


t 


min I 1 f + 7 • 


1 
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□ 

Remark B.4. For the lemma we have used both the sign and the permutation invariance, the sign invariance 
in (1061 and the permutation invariance in ( 107| |. As for Lemma ( |B.5|| bu t with a lot more effort, we can use 
the permutation invariance instead of using the sign invariance in ( |106| |. We will not go into details but via 
expanding the sum T{I, k) approximating P{"^i) ~ and keeping track of how often 

an atom (pj is in the support I one can show that as long as 5^ < A we still have ||i?(i ;)||2 < e • Cr^x^sjP 
which is the necessary ingredient for the convergence proof. An alternative criterion, that trades off 
permutation invariance for sign invariance, is again the one discussed in Remark B.2 However it is not 
enough to preserve Eq. ( 107| |, where we need that ^*jbk \\2 < £• For this inequality we do not 

only need to avoid that two atoms (pj and are always used in the same ratio, but that they are always 
used together no matter the ratio, because any two atoms and (j)k which span the same subspace have 
the same approximation properties. Indeed if x{j) and x{k) are both randomly ±.l/^/S then = (f)j + cjrk 
and — 4^k actually provide sparser approximations. 

Sublemma B.9. Let <!>/ be a subdictionary of ^ with (5(4>/) < cio 4'/ the corresponding subdictionary of an 
e-perturbation of 'k, that is d{^, ^) = e. If k e I then 


[P{fk) - F’(^/)] 4>k = P{^i)bk + hiM 'With 


li,kh < 


2ey/S 


Y(l-5o)(l-f)-2eV:S 

Proof: If < do we can use the expression for developed in Lemma A.2 of I1391I . 

Pi^i) = ($/ + I Is + ) (^/ + Qi‘^i)BiMiy, 


+ e\ -Wbkh. (125) 


2=1 


with Mi = Is+ '^{-^\Biy and Ri = MfB*iQ{^i)BiMi{^*i^j 




(126) 


2=1 


to get P{'fk)4>k = al{(pk + bk) and 

P(^ 7)4 = {<I>i + Q{<I>i)BiMi){^}<I>ir^ + 

OO 

= fk + Q{<Li)BiMii^}<I>i)-^<L}fk + (^7 + Q{^i)BiMi) ($J$/)-^ ^{-Riy<I>}fk 

2=1 

/ OO \ OO 

= fk + Q{^i)Bi Is + Y,i-'^\Biy ek\i + (‘k/ + Q{<I>i)BiMi){<I>}^i)-^ 


2=1 


2=1 


fk + bk- P{^i)bk + Q{<^i)Bi f2{-^\Biyek\i + (‘k/ + Q{<Li)BiMi) (<k)$,)-i ^(-i2,)*<k^4. 


2=1 


2=1 


Subtracting the projections we see that all that remains to do is to estimate the size of 


m,k := Q{<I>i)BiMi{^\Bi)ekii - ((4>J)* + Qi<I>i)BiMi{<I>}<I>i)-y (127) 

i=i 

Using standard bounds for matrix vector products and the identity ||(<k^<k 7 )“^|| 2,2 = ||‘k|||| 2 we get 


I,kh < WBlMlhMl^lbkh + (||<k|||2,2 + WBlMihMl^^Uh) 11^/112,211^/^^^4112 + 


i=0 


Wt 


Oik 


cxj 

< \\BlMih,2\\<I>\\\2,2\M2 + (|I‘F1||2,2 + \\BlMih,2\\^\\\l^) (141111211^/^/111,2)* \\Rl^M2 


+ 


OJ 


i=0 


Oik 


(M 
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We next expand remembering the definition of Rj and Mj as 




/ OQ 

(is+y: 


V 




2=1 


M^B^Qi^j) Id + Biek\i = MlB}Q{<!>i) Id + h 


2 = 1 


2=1 


to get 


\\Rl^}4>k\\2 < 11-6/^/112,2 ( 1 - ||6/||2,2||‘J’|||2,2) ll^fclb- 


-1 


Substituting this estimate together with the bound ||M/|| 2,2 < (^1 — || 6 /|| 2 , 2 ||‘f*/|| 2 , 2 j into the above 
bound for ||///,a:|| 2 / resolving the sums and fractions and noting that || 6 fc ||2 = ^ leads to. 


-1 


?/,fc||2 < 


2II6/II 


2,2 


<i>l||2-l-2||i?/||2,2 


+ 1 ‘ ll^fclb- 


To get to the final statement we use the bounds || 6/||2 2 < l| 6 /|||' < Ss‘^/{1 — e^/ 2 ) and ||<h |||2 2 ^ 

Vl - ^i^i) > VT^- ’ ’ □ 

Lemma B.IO. If for two vectors ip, f, where \\<f )\\2 = 1, and two scalars 0 < t < s we have, \\ip — s(j )\\2 < f^ then 

|2 


Ip 


< 2 -2Wl- 


(128) 

Proof: Writing ip = acp + ujz for some unit norm vector 2: with {z, (p) = 0 we can reformulate the 
initial constraint \\ip — s(p \\2 < to {a — s)^ +a;^ < while the quantity whose maximal size we have to 
estimate becomes 

VO + uj^ 

Solving the resulting maximisation problem we get that the maximum is attained at a = and 

(jj = and that therefore 


Ip 


Ip 


< 2 - 2a/1 - 


(130) 

□ 


References 

[1] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon. Learning sparsely used overcomplete dictionaries via 
alternating minimization. In COLT 2014 (arXiv:1310.7991), 2014. 

[2] A. Agarwal, A. Anandkumar, and P. Netrapalli. Exact recovery of sparsely used overcomplete dictionaries. In COLT 2014 
(arXiv:1309.1952), 2014. 

[3] M. Aharon, M. Elad, and A.M. Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse 
representation. IEEE Transactions on Signal Processing., 54(ll):4311-4322, November 2006. 

[4] S. Arora, A. Bhaskara, R. Ge, and T. Ma. More algorithms for provable dictionary learning. arXiv:1401.0579, 2014. 

[5] S. Arora, R. Ge, T. Ma, and A. Moitra. Simple, efficient, and neural algorithms for sparse coding. In COLT 2015 
(arXiv:1503.00778), 2015. 

[6] S. Arora, R. Ge, and A. Moitra. New algorithms for learning incoherent and overcomplete dictionaries. In COLT 2014 
(arXiv:1308.6273), 2014. 

[7] B. Barak, J.A. Kelner, and D. Steurer. Dictionary learning and tensor decomposition via the sum-of-squares method. In 
STOC 2015 (arXiv:1407.1543), 2015. 

[8] G. Bennett. Probability inequalities for the sum of independent random variables. Journal of the American Statistical 
Association, 57(297);33M5, March 1962. 

[9] T. Blumensath and M.E. Davies. Iterative thresholding for sparse approximations. Journal of Fourier Analysis and Applications, 
14(5-6):629-654, 2008. 



33 


[10] T. Blumensath and M.E. Davies. Iterative Hard Thresholding for compressed sensing. Applied Computational Harmonic 
Analysis, 27(3):265-274, 2009. 

[11] E. Candes, J. Romberg, and T. Tao. Robust uncertainty principles; exact signal reconstruction from highly incomplete 
frequency information. IEEE Transactions on Information Theory, 52(2):489-509, 2006. 

[12] O. Christensen. An Introduction to Frames and Riesz Bases. Birkhauser, 2003. 

[13] D.L. Donoho, M. Elad, and V.N. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of 
noise. IEEE Transactions on Information Theory, 52(1):6-18, January 2006. 

[14] K. Engan, S.O. Aase, and J.H. Husoy. Method of optimal directions for frame design. In ICASSP99, volume 5, pages 
2443-2446, 1999. 

[15] D.J. Eield and B.A. Olshausen. Emergence of simple-cell receptive field properties by learning a sparse code for natural 
images. Nature, 381:607-609, 1996. 

[16] S. Eoucart. Hard thresholding pursuit: An algorithm for compressive sensing. SIAM Journal on Numerical Analysis, 
49(6):2543-2563, 2011. 

[17] Q. Geng, H. Wang, and J. Wrighf. On fhe local correcfness of -minimization for dictionary learning. arXiv:1101.5672, 
2011 . 

[18] R Georgiev, E.J. Theis, and A. Cichocki. Sparse component analysis and blind source separation of underdetermined 
mixtures. IEEE Transactions on Neural Networks, 16(4);992-996, 2005. 

[19] R. Gribonval, R. Jenatton, and E Bach. Sparse and spurious: dictionary learning with noise and outliers. IEEE Transactions 
on Information Theory, 61(11):6298-6319, 2015. 

[20] R. Gribonval, R. Jenatton, F. Bach, M. Kleinsteuber, and M. Seibert. Sample complexity of dictionary learning and other 
matrix factorizations. IEEE Transactions on Information Theory, 61(6);3469-3486, 2015. 

[21] R. Gribonval and K. Schnass. Dictionary identifiability - sparse matrix-factorisation via -minimisation. IEEE Transactions 
on Information Theory, 56(7):3523-3539, July 2010. 

[22] D. Gross. Recovering low-rank matrices from few coefticienfs in any basis recovering low-rank matrices from few coefficients 
in any basis recovering low-rank matrices from few coefticienfs in any basis. IEEE Transactions on Information Theory, 
57(3);1548-1566, 2011. 

[23] E. Hock. Hard fhresholding pursuif for sparse approximation. BSc thesis. University of Innsbruck, 2016. 

[24] P. Jain, A. Tewari, and P. Kar. On iferative hard thresholding methods for high-dimensional m-estimation. In NIPS14 
(arXiv:14105137), 2014. 

[25] A. Jung, Y. Eldar, and N. Gortz. Performance limits of dictionary learning for sparse coding. In EUSIPC014 (arXiv:1402.4078), 
pages 765 - 769, 2014. 

[26] K. Kreutz-Delgado, J.P. Murray, B.D. Rao, K. Engan, T. Lee, and T.J. Sejnowski. Dictionary learning algorithms for sparse 
representation. Neural Computations, 15(2):349-396, 2003. 

[27] K. Kreutz-Delgado and B.D. Rao. FOCUSS-based dictionary learning algorithms. In SPIE 4119, 2000. 

[28] R. Kueng and D. Gross. Ripless compressed sensing from anisotropic measurements. Linear Algebra and its Applications, 
441:110-123, 2014. 

[29] M. Ledoux and M. Talagrand. Probability in Banach spaces. Isoperimetry and processes. Springer-Verlag, Berlin, Heidelberg, 
NewYork, 1991. 

[30] M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representations. Neural Computations, 12(2):337-365, 2000. 

[31] J. Mairal, F. Bach, and J. Ponce. Task-driven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 
34(4);791-804, 2012. 

[32] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine 
Learning Research, 11:19-60, 2010. 

[33] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. IMA 
Preprinf Series 2212, University of Minnesota, 2008. 

[34] A. Maurer and M. Pontil. K-dimensional coding schemes in Hilbert spaces. IEEE Transactions on Information Theory, 
56(ll):5839-5846, 2010. 

[35] N.A. Mehta and A.G. Gray. On the sample complexity of predictive sparse coding. arXiv:1202.4050, 2012. 

[36] V. Naumova and K. Schnass. Dictionary learning from incomplete data. Part I algorithms, in preparation, 2016. 

[37] M.D. Plumbley. Dictionary learning for £i-exact sparse coding. In M.E. Davies, C.J. James, and S.A. Abdallah, edifors. 
International Conference on Independent Component Analysis and Signal Separation, volume 4666, pages 406M13. Springer, 2007. 

[38] R. Rubinstein, A. Bruckstein, and M. Elad. Dictionaries for sparse representation modeling. Proceedings of the IEEE, 
98(6):1045-1057, 2010. 

[39] K. Schnass. On the identifiability of overcomplete dictionaries via the minimisation principle underlying K-SVD. Applied 
Computational Harmonic Analysis, 37(3);464M91, 2014. 

[40] K. Schnass. Local identification of overcomplefe dictionaries. Journal of Machine Learning Research (arXiv:1401.6354), 
16(Jun):1211-1242, 2015. 

[41] K. Schnass. Sequential dictionary learning with model selection, in preparation, 2016. 

[42] K. Schnass and P. Vandergheynst. Average performance analysis for fhresholding. IEEE Signal Processing Letters, 14(11);828- 
831, 2007. 

[43] K. Skreffing and K. Engan. Recursive leasf squares dictionary learning algorifhm. IEEE Transactions on Signal Processing, 
58(4):2121-2130, April 2010. 

[44] D. Spielman, H. Wang, and J. Wright. Exact recovery of sparsely-used dictionaries. In COLT 2012 (arXiv:1206.5882), 2012. 

[45] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere. In ICML 2015 (arXiv:1504.06785), 2015. 

[46] J.A. Tropp. On the conditioning of random subdicfionaries. Applied Computational Harmonic Analysis, 25(1-24), 2008. 



34 


[47] D. Vainsencher, S. Manner, and A.M. Bruckstein. The sample complexity of dictionary learning, journal of Machine Learning 
Research, 12(3259-3281), 2011. 

[48] M. Yaghoobi, T. Blumensath, and M.E. Davies. Dictionary learning for sparse approximations with the majorization method. 
IEEE Transactions on Signal Processing, 57(6):2178-2191, June 2009. 

[49] M. Zibulevsky and B.A. Pearlmutter. Blind source separation by sparse decomposition in a signal dictionary. Neural 
Computations, 13(4):863-882, 2001. 



